whoa and Deciding We Could Drop Two of the
Amplicons in a Future Revision of the Baselineif(exists("snakemake")) {
input_list <- snakemake@input
output_list <- snakemake@output
} else {
input_list <- list(
final_baseline = "data/SWFSC-chinook-reference-baseline.csv.gz",
pop_labels = "inputs/reference-collection-names.csv"
)
# outputs:
output_list <- list(
whoa_zs = "results/whoa/heterozygote-z-scores.pdf"
)
}
# you can create the necessary output directories like this:
dump <- lapply(output_list, function(x)
dir.create(dirname(x), recursive = TRUE, showWarnings = FALSE)
)
library(tidyverse)
library(whoa)
# read in the full baseline
full_base <- read_csv(input_list$final_baseline)
# capture that rosa hapstr with some meta data from future use
rosa_genos0 <- full_base %>%
select(indiv:rosa_genotypes_str)
# read the pop labels for proper sorting
pop_labels <- read_csv(input_list$pop_labels)
# rename those collections and repunits
full_base2 <- full_base %>%
rename(old_name = collection) %>%
select(-repunit) %>%
left_join(pop_labels %>% select(old_name, collection, repunit), by = join_by(old_name)) %>%
select(-old_name) %>%
select(indiv, repunit, collection, sample_type, everything()) %>%
filter(!duplicated(indiv)) # necessary cuz clemento stuck mill and deer together into a single code
# finally, get the order of collections and repunits we want
collection_order <- unique(pop_labels$collection)
Get things into long format:
fb_long <- full_base2 %>%
select(-sample_type, -rosa_genotypes_str) %>%
pivot_longer(
cols = -(indiv:collection),
names_to = c("locus", "gene_copy"),
values_to = "allele",
names_pattern = "^(.*)_([12])$"
)
Now, for every allele in the data set we define a new locus—one in which the allele is the 1 allele and everything else is the 0 allele.
For the purposes of passing things off to whoa, we are going to use
as “loki” the combination of locus and
recoding, so we will just paste those with a dot.
Whoa says it wants loci named by Chrom-Pos but I don’t think that is actually necessary, so I will just paste recoding on with a dot to the locus name.
RecodedAlleles <- fb_long %>%
filter(!is.na(allele)) %>% # it is very important to remove missing genotypes
filter(locus != "Ots_coho001_05_32691399") %>% # no reason to have the coho allele
group_by(locus) %>%
mutate(
alle_int = as.integer(factor(allele, levels = sort(unique(allele)))),
alle_cnt = length(levels(factor(allele)))
) %>%
ungroup() %>%
mutate(
boing = map2(.x = alle_int, .y = alle_cnt, .f = function(x, y) {
list(
recoding = 1:y,
new_alle = ifelse(x == 1:y, 1L, 0L)
)
})
) %>%
mutate(
recoding = map(boing, "recoding"),
new_alle = map(boing, "new_alle")
) %>%
select(-boing) %>%
unnest(cols = c(recoding, new_alle)) %>%
#extract(locus, into = c("chrom", "pos"), regex = "Ots_[a-z]+[0-9]+_([0-9]+)_([0-9]+)", remove = FALSE) %>%
#mutate(loki = str_c("Ots", chrom, "_", recoding, "-", pos ), .after = locus) %>%
mutate(loki = str_c(locus, recoding, sep = "."))
#select(-chrom, -pos)
That seemed to go well. In order to be able to get back from the new_alle and recoding to the actual allele, in case we want to, and to verify that things worked, we can collapse this into the different cases:
haplo_integers <- RecodedAlleles %>%
distinct(locus, loki, allele, alle_int, alle_cnt, recoding, new_alle) %>%
arrange(locus, recoding, allele)
haplo_integers
That all looks just right.
Recall, for whoa we want an 012 matrix with missing data denoted by -1.
So, we need to turn these thing back into genotypes for the different individuals. Once I am done with that, I am going to nest the 012 matrices up by repunit and collection
Recoded_012_full_tib <- RecodedAlleles %>%
select(repunit, collection, indiv, loki, new_alle) %>%
group_by(repunit, collection, indiv, loki) %>%
summarise(geno = sum(new_alle)) %>%
ungroup() %>%
pivot_wider(names_from = loki, values_from = geno, values_fill = -1L) %>%
nest(tibble = -c(repunit, collection)) %>%
mutate(
d012 = map(tibble, function(x) x %>% select(-indiv) %>% as.matrix() %>% t())
)
Cool, now we have a list column with a d012 for each collection.
Now we can make a new list column with the whoa output.
tib_with_plots <- Recoded_012_full_tib %>%
mutate(
exp_and_obs = map(d012, function(x) whoa::exp_and_obs_geno_freqs(d012 = x)),
scatter = pmap(
list(eo = exp_and_obs, col = collection, ti = tibble),
function(eo, col, ti) {
whoa::geno_freqs_scatter(eo, max_plot_loci = 1000) +
ggtitle(str_c(col, " (", nrow(ti), ")"))
}
)
)
We can see those all in the notebook like this
tib_with_plots$scatter
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
There are clearly a few alleles that do not show up in heterozygous form as freuquently as expected. I suspect that these may belong to just one or two loci.
To find which loci are involved, we will simply sort on the z-scores for the heterozygotes.
sorted_losers <- tib_with_plots %>%
select(repunit, collection, exp_and_obs) %>%
unnest(exp_and_obs) %>%
filter(geno == "1") %>%
arrange(z_score)
Looking at that, it is clear that two amplicons are heavily
implicated: Ots_mhap073_19_03686095 and
Ots_mhap093_27_16923862. In the Eel river we also see
Ots_mhap053_11_42403691 pop up once or twice amongst the
most extremely low z-scores for hets, and also we see
Ots_mhap057_13_38673256.2 at Cole Rivers. Eventually with
z-scores above -3 we start to see more loci that are suspect are just
due to random variation.
But, let’s investigate those few loci that we saw in the context of all of the loci. This makes a massive faceted thing that is really only viewable as a PDF, so we just save it.
zs <- sorted_losers %>%
extract(snp, into = "locus", regex = "^(.+)\\.[0-9]+", remove = FALSE) %>%
ggplot(aes(x = z_score)) +
geom_vline(xintercept = 0, color = "red") +
geom_histogram() +
theme_bw() +
facet_wrap(~ locus, ncol = 10)
ggsave(zs, filename = output_list$whoa_zs, width = 20, height = 30)
## Warning: Removed 1946 rows containing non-finite outside the scale range
## (`stat_bin()`).
That provides a lovely perspective on things. mhap073 and mhap093 are clearly pathological. mhap053 and mhap057 show a slight downward trend in the z-scores, but it must be somewhat variable by population, and I don’t think it is enough to toss them out.
mhap073 and mhap093 have 4 and 2 alleles respectively, so tossing
them won’t be a big deal. mhap093 only has two alleles but the are named
things like CCGCCGTTGCT and CTGCCGTTGCT, so
that is pretty messed up anyway.
So, I think we should toss those two loci out.
If we do toss them out, this is what the geno freq plots look like.
tib_with_dropped_plots <- tib_with_plots %>%
mutate(dropped_eo = map(exp_and_obs, function(x) x %>% filter(!str_detect(snp, "Ots_mhap073_19_03686095|Ots_mhap093_27_16923862")))) %>%
mutate(
dropped_scatters = pmap(
list(eo = dropped_eo, col = collection, ti = tibble),
function(eo, col, ti) {
whoa::geno_freqs_scatter(eo, max_plot_loci = 1000) +
ggtitle(str_c(col, " (", nrow(ti), ")"))
}
)
)
tib_with_dropped_plots$dropped_scatters
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
##
## [[11]]
##
## [[12]]
##
## [[13]]
##
## [[14]]
##
## [[15]]
##
## [[16]]
##
## [[17]]
So, what we see is that most of the really egregious ones are gone, but in some populations there are a few that look a little wonky, but those are only in a few pops. So, I don’t think we really would be called upon to toss any others in a future revision of the baseline.
sessioninfo::session_info()
## ─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.4.3 (2025-02-28)
## os macOS Monterey 12.7.5
## system aarch64, darwin20
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz America/Denver
## date 2025-03-27
## pandoc 3.6.4 @ /Users/eriq/Documents/git-repos/california-chinook-microhaps/.snakemake/conda/def120e971fb2c87a8c042b4c2c09bdb_/bin/ (via rmarkdown)
## quarto 1.4.550 @ /usr/local/bin/quarto
##
## ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## ! package * version date (UTC) lib source
## P bit 4.6.0 2025-03-06 [?] CRAN (R 4.4.1)
## P bit64 4.6.0-1 2025-01-16 [?] CRAN (R 4.4.1)
## P bslib 0.9.0 2025-01-30 [?] CRAN (R 4.4.1)
## P cachem 1.1.0 2024-05-16 [?] CRAN (R 4.4.1)
## P cli 3.6.4 2025-02-13 [?] CRAN (R 4.4.1)
## P colorspace 2.1-1 2024-07-26 [?] CRAN (R 4.4.1)
## P crayon 1.5.3 2024-06-20 [?] CRAN (R 4.4.1)
## P digest 0.6.37 2024-08-19 [?] CRAN (R 4.4.1)
## P dplyr * 1.1.4 2023-11-17 [?] CRAN (R 4.4.0)
## P evaluate 1.0.3 2025-01-10 [?] CRAN (R 4.4.1)
## P farver 2.1.2 2024-05-13 [?] CRAN (R 4.4.1)
## P fastmap 1.2.0 2024-05-15 [?] CRAN (R 4.4.1)
## P forcats * 1.0.0 2023-01-29 [?] CRAN (R 4.4.0)
## P generics 0.1.3 2022-07-05 [?] CRAN (R 4.4.1)
## P ggplot2 * 3.5.1 2024-04-23 [?] CRAN (R 4.4.0)
## P glue 1.8.0 2024-09-30 [?] CRAN (R 4.4.1)
## P gtable 0.3.6 2024-10-25 [?] CRAN (R 4.4.1)
## P hms 1.1.3 2023-03-21 [?] CRAN (R 4.4.0)
## P htmltools 0.5.8.1 2024-04-04 [?] CRAN (R 4.4.1)
## P jquerylib 0.1.4 2021-04-26 [?] CRAN (R 4.4.0)
## P jsonlite 1.9.1 2025-03-03 [?] CRAN (R 4.4.1)
## P knitr 1.49 2024-11-08 [?] CRAN (R 4.4.1)
## P labeling 0.4.3 2023-08-29 [?] CRAN (R 4.4.1)
## P lifecycle 1.0.4 2023-11-07 [?] CRAN (R 4.4.1)
## P lubridate * 1.9.4 2024-12-08 [?] CRAN (R 4.4.1)
## P magrittr 2.0.3 2022-03-30 [?] CRAN (R 4.4.1)
## P munsell 0.5.1 2024-04-01 [?] CRAN (R 4.4.1)
## P pillar 1.10.1 2025-01-07 [?] CRAN (R 4.4.1)
## P pkgconfig 2.0.3 2019-09-22 [?] CRAN (R 4.4.1)
## P purrr * 1.0.4 2025-02-05 [?] CRAN (R 4.4.1)
## P R6 2.6.1 2025-02-15 [?] CRAN (R 4.4.1)
## P ragg 1.3.3 2024-09-11 [?] CRAN (R 4.4.1)
## P Rcpp 1.0.14 2025-01-12 [?] CRAN (R 4.4.1)
## P readr * 2.1.5 2024-01-10 [?] CRAN (R 4.4.0)
## renv 1.1.4 2025-03-20 [1] CRAN (R 4.4.1)
## P rlang 1.1.5 2025-01-17 [?] CRAN (R 4.4.1)
## P rmarkdown 2.29 2024-11-04 [?] CRAN (R 4.4.1)
## P sass 0.4.9 2024-03-15 [?] CRAN (R 4.4.0)
## P scales 1.3.0 2023-11-28 [?] CRAN (R 4.4.0)
## P sessioninfo 1.2.3 2025-02-05 [?] CRAN (R 4.4.1)
## P stringi 1.8.4 2024-05-06 [?] CRAN (R 4.4.1)
## P stringr * 1.5.1 2023-11-14 [?] CRAN (R 4.4.0)
## P systemfonts 1.2.1 2025-01-20 [?] CRAN (R 4.4.1)
## P textshaping 1.0.0 2025-01-20 [?] CRAN (R 4.4.1)
## P tibble * 3.2.1 2023-03-20 [?] CRAN (R 4.4.0)
## P tidyr * 1.3.1 2024-01-24 [?] CRAN (R 4.4.1)
## P tidyselect 1.2.1 2024-03-11 [?] CRAN (R 4.4.0)
## P tidyverse * 2.0.0 2023-02-22 [?] CRAN (R 4.4.0)
## P timechange 0.3.0 2024-01-18 [?] CRAN (R 4.4.1)
## P tzdb 0.4.0 2023-05-12 [?] CRAN (R 4.4.0)
## P vctrs 0.6.5 2023-12-01 [?] CRAN (R 4.4.0)
## P vroom 1.6.5 2023-12-05 [?] CRAN (R 4.4.0)
## P whoa * 0.0.2 2021-08-11 [?] RSPM
## P withr 3.0.2 2024-10-28 [?] CRAN (R 4.4.1)
## P xfun 0.51 2025-02-19 [?] CRAN (R 4.4.1)
## P yaml 2.3.10 2024-07-26 [?] CRAN (R 4.4.1)
##
## [1] /Users/eriq/Documents/git-repos/california-chinook-microhaps/renv/library/macos/R-4.4/aarch64-apple-darwin20
## [2] /Users/eriq/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/aarch64-apple-darwin20/f7156815
##
## * ── Packages attached to the search path.
## P ── Loaded and on-disk path mismatch.
##
## ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Running the code and rendering this notebook required approximately this much time:
Sys.time() - start_time
## Time difference of 20.69929 secs